Genome Research
● Cold Spring Harbor Laboratory
Preprints posted in the last 30 days, ranked by how well they match Genome Research's content profile, based on 409 papers previously published here. The average preprint has a 0.15% match score for this journal, so anything above that is already an above-average fit.
Liu, M.; Mamede, I.; Sofi, S.; Pereira, I.; Dostal, V.; Pashos, A. R. S.; McMahon, C.; Waikar, A.; Stephenson, G.; Cech, T. R.; Rinn, J. L.
Show abstract
Some long non-coding RNAs (lncRNAs) are known to regulate gene expression. However, the underlying temporal dynamics of lncRNAs influencing gene and epigenetic regulation and mechanisms of lncRNA regulation in trans are less understood. To investigate this, we genetically engineered 17 doxycycline-inducible lncRNA transgenes for ectopic expression at the H11 safe harbor locus in human pluripotent stem cells (hiPSCs), and we generated high-density temporal RNA-seq and ATAC-seq profiles. Most lncRNA transgenes were induced at 2 hours and maintained expression through the 96-hour time course. Surprisingly, when we sought to identify gene expression changes due to the lncRNAs, we found that the global transcriptional landscape was dominated by a strong systemic response triggered by doxycycline exposure. We rigorously defined this cohort of genes as a Doxycycline-Responsive Gene Signature (DRGS). The DRGS was also present in at least 28 public datasets from dox-inducible transgene studies involving diverse cell types. Next, we determined which lncRNAs exhibited trans-regulatory events. We identified DANCR, FENDRR, LINC00667, LINC00847, LNCPRESS1, and PNKY as lncRNAs that regulate specific transcript expression in trans. The downstream target genes encoded 53 mRNAs and 10 lncRNAs. None of the target lncRNAs altered gene expression proximal to their own loci (i.e., triggering secondary cis-effects). Surprisingly, the target genes of LINC00847 (transcribed from chromosome 22) were substantially enriched on chromosome 19, with a preponderance of target genes encoding RNA metabolism and RNA splicing factors. Collectively, our study provides a resource to discern artifacts in the doxycycline-inducible system and identifies temporally regulated targets of 6 lncRNAs for future mechanistic studies.
Brandulas Cammarata, A.; Fonseca Costa, S. S.; Rosikiewicz, M.; Roux, J.; Wollbrett, J.; Bastian, F. B.; Robinson-Rechavi, M.
Show abstract
RNA-Seq is a powerful technique to provide quantitative information on gene expression. While many applications focus on measuring expression levels, accurately distinguishing between actively and inactively transcribed genes is equally important for understanding gene function, development, and disease mechanisms. However, setting a biologically meaningful threshold for calling genes expressed is challenging due to variability in noise levels across different protocols, experiments or biological samples. We propose to define this threshold per sample relative to the background level observed in inactive genomic features, inferred by the amount of reads mapped to intergenic regions of the genome, and to call genes expressed if their level of expression is significantly higher than the estimated background noise. This approach can be applied to a single RNA-Seq library as well as to a combination of libraries from the same condition, in model and non-model organisms. We show that our method yields a more accurate prediction of expression state than existing methods, illustrated by consistent expression calls for biological replicates in the same tissue.
Mahajan, D.; Jain, C.; Kashyap, N.
Show abstract
Oxford Nanopore Technologies adaptive sampling capability promises to reduce sequencing cost and turnaround time. At its core, adaptive sampling is a real-time classification problem that distinguishes reads originating from regions of interest. Direct signal-based classification approaches bypass the computational bottleneck of basecalling and can eliminate the need for powerful GPUs. However, operating directly on noisy raw signals remains challenging in real-time settings, where classification decisions must be made quickly. In this work, we propose NanoLabel, a new method for real-time classification of nanopore signals. We build NanoLabel on top of signal-based read mapping tool, RawHash2. We accelerate the classification workflow by mapping reads using only the target regions as the reference. To further improve accuracy, we train a lightweight classifier on mapping-derived features and introduce a data augmentation strategy to construct sufficiently large and class-balanced training datasets. We evaluate NanoLabel using publicly available real sequencing datasets from three human genomes (HG001, HG002, and HG005), while assuming a cancer gene panel as the target. Compared to directly mapping reads with RawHash2, we demonstrate 80 x improvement in the classification time and 0.10 - 0.25 units improvement in the F1 score.
Niu, Z.; He, Y.; Galante, J.; Gschwind, A. R.; Ray, J.; Steinmetz, L. M.; Engreitz, J. M.; Katsevich, E.
Show abstract
CRISPR screens with single-cell RNA-seq readouts provide a powerful tool for characterizing the functions of noncoding elements and genes. However, designing these experiments to balance statistical power and cost is challenging, given the large number of design parameters. The only available tool for this purpose is a simulation-based power calculator, but it is computationally costly and requires high-performance computing to run. We derive a novel analytical formula for the power to detect perturbation-expression associations, recapitulating power estimates from the simulation-based tool while reducing runtime by up to seven orders of magnitude. This acceleration unlocks the possibility of interactive single-cell CRISPR screen design. Accordingly, we develop PerturbPlan, an interactive web application built on the analytical power formula. PerturbPlan helps users address 11 design questions for two types of single-cell CRISPR screens, Perturb-seq and targeted Perturb-seq (TAP-seq). We apply PerturbPlan to carry out a comparative analysis of three recent Perturb-seq designs, demonstrating how optimal design varies across experiments of different scales. We also use PerturbPlan to quantify the cost savings of a recent TAP-seq study relative to a hypothetical Perturb-seq study assaying the same perturbations, illustrating how the tool can inform decisions about targeted versus whole-transcriptome readouts. In sum, PerturbPlan is the first tool to facilitate flexible and interactive design of well-powered single-cell CRISPR screen experiments.
Prillo, S.; Rimini, D.; Olivares-Chauvet, P.; Song, Y. S.; Yosef, N.
Show abstract
Single-cell lineage tracing technologies are providing new and powerful ways to interrogate the evolution and divergence of cell populations in cancer, development, and other contexts. A key initial step in any such analysis is the grouping of cells into clonal populations, based on clone-level marks. Unfortunately, clone calling is prone to technical effects due to sequencing errors, missing data, multiplets, background noise, and accidental sharing of clonal barcodes between unrelated clones (homoplasy). We present NovaClone, a principled algorithm for hierarchical clone calling that is broadly applicable to all current tracing technologies, including both static barcoding and the more recent evolving tracers. We benchmark NovaClone on simulated and real data to show that it outperforms the current solutions in terms of both quality and speed, thereby helping to mitigate one of the most prevalent problems with single-cell lineage tracing. To complement NovaClone, we introduce a suite of algorithm-agnostic quality control metrics to evaluate clone calls when ground truth is not available. NovaClone and the associated QCs are available through the open source Python package nova-clone.
Myers, M. A.; Satas, G.; Shah, S.; Mcpherson, A.
Show abstract
Correctly inferring copy-number aberrations from single-cell DNA sequencing data requires estimating cellular DNA content, which is unidentifiable from read counts alone. In tagmentation-based sequencing, each fragment represents a distinct DNA molecule, thus fragment overlaps provide an orthogonal signal for copy number. We present a theoretical model of fragment overlaps as a function of copy number and coverage and introduce scPlOver, a method that uses this model to infer DNA content. scPlOver outperforms existing approaches on simulated and experimental datasets and identifies thousands of ovarian cancer cells with higher DNA content than previously estimated across a cohort of 41 patients.
Dagilis, A. J.; DiAngelis, B.; Lee, S.; Matute, D. R.
Show abstract
Co-evolution between genes can occur for a variety of reasons, including co-expression of genes, epistatic interactions between them, physical interactions of gene products and many others. Co-evolutionary partners of a gene are therefore of great interest in identifying potential factors that contribute to any phenotype of interest. State-of-the-art approaches to detect these interactions use correlations of evolutionary rates across a broader phylogeny, and so by necessity identify interactions only among genes that are present across long evolutionary time periods. This makes the methods unwieldy when interest lies in a single focal organism in which the genes of interest may have evolved in the recent evolutionary past. Here, we present a new approach to calculating evolutionary rate correlations which focuses on extracting maximum coverage for a single focal species, while retaining signals of co-evolution across large clades. We show how this approach is able to identify potential interactions even in highly studied species and highly studied genes, with a focus on the D. melanogaster sex-determiner, Sxl, using data from 72 species of Dipterans.
Ribes, R.; Mandier, C.; Baniel, A.
Show abstract
PCR duplicate removal is a critical first step in high-throughput sequencing pipelines, yet existing tools struggle with speed, memory, or correctness at modern dataset scales. We present FastDedup, a Rust-based FASTX deduplicator that transforms each read or read pair to a compact xxh3 hash fingerprint, drastically reducing memory usage and binding most of the execution time to disk I/ O. Benchmarked against six competing tools on synthetic human WGS datasets up to 300 million reads, FastDedup consistently leads on paired-end data, running more than 10 times faster than fastp. It also outperforms all tools on uncompressed single-end data, deduplicating a million reads in a second. We additionally report correctness failures in prinseq++ and clumpify. FastDedup is available under the MIT License via GitHub, Bioconda, and Cargo.
Furutani, T.; Ji, H.
Show abstract
While multimodal sequencing technologies are rapidly advancing, most single-cell and spatial datasets still measure only a single modality. Integrative computational methods for separately profiled single-cell RNA-seq (scRNA-seq) and ATAC-seq (scATAC-seq) data typically rely on the assumption that gene expression correlates with the chromatin accessibility of nearby regulatory regions. However, the strength and reliability of these correlations vary substantially across genes, and incorporating low-confidence associations can compromise integration accuracy. Here, we introduce the CLIC (Cross-modality Link Confidence) score, a quantitative measure of the empirical concordance between gene expression and nearby chromatin accessibility, derived from diverse single-cell multiome datasets from the ENCODE project. CLIC scores provide prior confidence estimates for gene-peak associations across modalities. Building on this, we propose a hybrid feature selection strategy that intersects highly variable genes with high-CLIC genes, generating feature sets that better align with the assumptions of cross-modal integration methods. Across diverse publicly available single-cell and spatial datasets, and multiple state-of-the-art integration frameworks, our approach consistently improves the integration of gene expression and chromatin accessibility data, enhancing both robustness and biological interpretability. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=69 SRC="FIGDIR/small/723400v1_ufig1.gif" ALT="Figure 1"> View larger version (18K): org.highwire.dtl.DTLVardef@13208b8org.highwire.dtl.DTLVardef@1da7808org.highwire.dtl.DTLVardef@1fe5c53org.highwire.dtl.DTLVardef@5f4e2a_HPS_FORMAT_FIGEXP M_FIG C_FIG
Xu, Z.; Wang, K.
Show abstract
Allele-specific analysis from RNA-seq is a powerful approach to characterize cis-regulatory effects. However, existing methods remain limited in both haplotype inference and allelic testing. Their haplotype-inference workflows separate variant calling, haplotype phasing, and read-haplotype assignment into sequential steps, failing to fully exploit within-read SNV linkage information and propagating errors into downstream allelic analysis. At the testing stage, they ignore non-phasable reads lacking heterozygous SNVs, biasing calls and inflating false positives, and remain incomplete across gene-, isoform-, and local-event-level variant effects. Here, we present LongAllele, a statistical framework that employs an expectation-maximization algorithm to jointly infer heterozygous variants, haplotype structure, and read-haplotype assignments from long-read bulk and single-cell RNA sequencing. LongAllele further introduces phasability-aware testing that explicitly accounts for non-phasable reads, avoiding inflated false-positive calls when haplotype information is incomplete. It also enables comprehensive allelic testing across gene-level ASE, isoform-level allele-specific transcript usage (ASTU), and local-event-level haplotype-associated exon and junction usage (HAEU and HAJU), providing a multi-scale view of cis-regulation. We applied LongAllele to long-read RNA-seq datasets spanning GTEx (multi-tissue bulk), peripheral blood mononuclear cells (single-cell), and human hippocampus (single-nucleus). LongAllele consistently revealed greater tissue and cell-type variability in expression-level than isoform-level allelic regulation, pinpointed high-impact regulatory variants including rare splice-site mutations missed by standalone variant callers, and showed that purifying selection constrains allelic imbalance at both gene and isoform levels. LongAllele offers a unified framework for haplotype-resolved cis-regulatory analysis across diverse cellular contexts.
Karbalayghareh, A.; Paull, E.; Califano, A.
Show abstract
Learning causal gene regulatory mechanisms from single-cell data, and thereby predicting the effects of unseen perturbations, remains challenging. Observational RNA-seq data alone is insufficient for causal modeling, whereas perturbational data is essential. Classical causal inference methods often rely on unrealistic directed acyclic graph (DAG) assumptions and are not well suited to integrating multimodal data. Current transcriptomic foundation models also typically treat observational and perturbational data identically, limiting their ability to model perturbations. We present DoFormer, a causal multimodal Transformer that makes no DAG assumptions and leverages rich perturbational data to accurately predict previously unseen perturbations. DoFormer enables principled in silico perturbations by adapting the causal do-operator within the attention mechanism: the perturbed gene is set to the intervention value and prevented from attending to other genes, allowing the model to fully distinguish observational from interventional regimes. We train DoFormer using biologically informed loss functions and evaluate it with comprehensive perturbation prediction metrics. DoFormer substantially improves perturbation prediction relative to baseline and prior foundation models, underscoring the importance of intervention-aware architectures and biologically grounded objectives for causal modeling in single-cell genomics.
Das, R.; Dey, A.; Maulik, U.; Bandyopadhyay, S.
Show abstract
Clustering plays a critical role in the analysis of single-cell omics data for identifying cellular heterogeneity and uncovering biological mechanisms. However, the high dimensionality, sparsity, and multimodal nature of single-cell datasets such as single-cell RNA sequencing (scRNA-seq) and Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) pose significant challenges for effective feature learning and representation learning. Traditional dimensionality reduction methods often rely on linear transformations and fail to capture complex nonlinear relationships between gene and protein expression profiles. In this work, we propose SRSA-VAE, a scalable variational autoencoder framework that integrates a residual self-attention encoder for context-aware feature learning and multimodal representation learning. The proposed model dynamically contextualizes gene and protein representations through a self-attention mechanism, enabling the encoder to capture inter-cell relationships and emphasize biologically informative signals. A scalable residual connection further stabilizes training and preserves essential input information during latent representation learning. We evaluate SRSA-VAE on five large-scale publicly available single-cell datasets, including both scRNA-seq and CITE-seq data, and compare its performance with established deep generative models. Experimental results demonstrate that SRSA-VAE consistently outperforms existing methods in Adjusted Rand Index (ARI) across benchmark datasets, with particularly strong gains on complex immune cell populations. Ablation studies further confirm the importance of the self-attention mechanism and residual connection in enhancing model stability and clustering accuracy. The proposed model offers a generalizable, robust, and scalable solution for single-cell clustering tasks. Code Repositoryhttps://github.com/rangan2510/srsa-vae
Mermigkis, G.; Sofotasios, A.; Kontopoulou, E.-M.; Gallopoulos, E.; Hadjidoukas, P.
Show abstract
Principal Component Analysis (PCA) is a fundamental tool in human genetics, widely used to study population structure. However, the rapid growth of modern genomic datasets, which often exceed main memory capacity, renders traditional PCA methods infeasible, motivating out-of-core approaches. Prior work on out-of-core genomic PCA has focused primarily on optimizing the inherently compute-intensive numerical core, largely overlooking the stages of data I/O and preprocessing, which emerge as significant performance bottlenecks at tera-scale. Furthermore, existing approaches remain limited to shared-memory single-node architectures, lacking support for distributed multi-node environments. To address these limitations, we introduce DistPCA, the first distributed out-of-core framework for tera-scale genomic PCA, implemented as a C++ library and scalable across both single- and multi-node systems. Built on top of Message Passage Interface (MPI), the proposed framework employs multi-level data parallelism across the entire PCA pipeline, combining multiprocessing, multithreading, SIMD vectorization, and compute-transfer overlap, while remaining compatible with block-based methods that rely on associative operations. Extensive evaluation on real and synthetic datasets demonstrates near-linear scalability, achieving speedups of up to 58.2x and over 98% reduction in wall-clock time, while maintaining parallel efficiency above 82% and preserving accuracy in the recovered principal components.
A.M., V.; Zhang, Q.; Srivastava, S.; Koronowski, K. B.; Srivastava, A.
Show abstract
The circadian clock genes Bmal1 and Nr1d1/2 (REV-ERB/{beta}) regulate skeletal muscle metabolism and homeostasis, yet the precise genes and mechanisms involved remain incompletely understood. Here, we perform Weighted Gene Co-expression Network Analysis (WGCNA) on skeletal muscle circadian transcriptomes with varying Bmal1 operational status to identify genes central to muscle circadian function. The largest WGCNA module, potentially under Bmal1 regulation, contains clock and muscle-specific output genes governed hierarchically by hub genes including Igf2bp2, an RNA-binding protein involved in muscle progenitor growth and maintenance. Igf2bp2 expression is rhythmic in mouse and human muscle and functional experiments in muscle-specific Bmal1 knockout mice show that Igf2bp2 is upregulated by loss of Bmal1 at ZT8 and negatively correlated with Nr1d2, suggesting de-repression through REV-ERB{beta} as a regulatory mechanism. Luciferase reporter experiments in cultured myotubes show that REV-ERB{beta}, but not REV-ERB, represses Igf2bp2 transcription and that repression is mediated by non-canonical GCC motifs in the Igf2bp2 promoter region. Together, these findings uncover a circadian Nr1d2-Igf2bp2 regulatory axis linking transcriptional and post-transcriptional regulation in skeletal muscle, with implications for muscle homeostasis. HighlightsO_LIIgf2bp2 clusters with Nr1d2 (Rev-erb{beta}) in circadian co-expression network C_LIO_LIBmal1 or Rev-erb[a]/{beta} knockout upregulates Igf2bp2 in muscle C_LIO_LIIgf2bp2 is rhythmic in WT muscle but arrhythmic in clock mutant muscle C_LIO_LIREV-ERB{beta} represses Igf2bp2 transcription in myotubes C_LIO_LIREV-ERB{beta} repression requires GCC motifs in the Igf2bp2 promoter C_LI Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=200 SRC="FIGDIR/small/724827v1_ufig1.gif" ALT="Figure 1"> View larger version (89K): org.highwire.dtl.DTLVardef@2f569borg.highwire.dtl.DTLVardef@1df13a7org.highwire.dtl.DTLVardef@83538borg.highwire.dtl.DTLVardef@1e20983_HPS_FORMAT_FIGEXP M_FIG C_FIG
Oubninte, S.; Ruczinski, I.; Yanek, L. R.; Mathias, R.; Bureau, A.
Show abstract
Few studies assessed the performance of population-based phasing combined with parental genotypes to infer recombination on whole genome sequence (WGS) data. In this study, our objective was to evaluate whether Shapeit2 duoHMM, a Hidden Markov Model using parental genotypes, infers recombination events reliably on WGS and with narrower intervals than SNP arrays. We based our analysis on the overlap between recombination events inferred by Merlin on SNP genotypes and Shapeit2 on WGS and SNP genotypes. We used a sample of 61 extended families from the GeneSTAR study with TopMED freeze 8 WGS on 580 sequenced subjects (60% of sample). Shapeit2 was run with a window size of 500 kilobases and 200 states on WGS. To mimic a SNP array, we extracted genotypes of 355,112 autosomal markers on the Illumina OmniExpress array. The number of recombination events per meiosis inferred by Shapeit2 on the WGS data (36.8) was aligned with the expected numbers over autosomes (35.7), although Merlin overestimated this number (115.0). 73% of Shapeit2 recombination events on WGS were detected by Merlin, a proportion rising to 91% when restricting to events also inferred by Shapeit2 on OmniExpress genotypes. Furthermore, Shapeit2 recombination intervals were narrower on WGS than OmniExpress genotypes (median of 4,530 bp vs. 49,458 bp). This suggests that Shapeit2 on WGS is a reliable and accurate method for inferring recombination events.
Queme, B.; Marjoram, P.; Mi, H.
Show abstract
Over-representation analysis (ORA) is the most commonly used interpretation tool for gene lists despite well-documented limitations: pathway boundaries are fixed, genes are assumed independent, and results depend on the background set. Network-based methods address these using interaction-network modularity, but introduce hub bias: highly connected genes appear clustered under naive nulls because curated networks overrepresent well-studied genes. Existing corrections are imperfect: edge permutation destroys the topology the test should condition on, and propagation methods hide the confound in parameter tuning. We introduce MANGO (Morans Autocorrelation for Network Gene Over-representation), which asks one conditional question: does a gene sets spatial autocorrelation on a fixed biological network exceed what its degree composition alone would predict? MANGO computes Global Morans I under a null that conditions on both the network and the binned degree distribution of the gene set, then decomposes significant signals at the component and gene level. In benchmarks, uniform nulls produce a false positive rate of 1.0 on hub-enriched gene sets with no real clustering; ten-bin degree-stratified nulls bring that to 0.0 with no power loss (AUC [≥] 0.98; on degree-typical signals, |{Delta}AUC| [≤] 0.004). Pathway-spiking simulations confirm detection of real biological clustering across diverse pathway sizes and degree profiles. Applied to the FIGI colorectal cancer GWAS (204 SNPs), the set is degree-typical (KS p = 0.83), yet Morans I is highly significant (p < 0.001). Component-level jackknife localizes the entire signal to a single 24-gene module spanning TGF-{beta}, Wnt/cadherin, and related pathways, with four bottlenecks (SMAD3, MYC, CTNNB1, PTPN1) matching established CRC driver biology. eTOC blurbMANGO tests whether a gene sets spatial autocorrelation on a biological network exceeds what its degree composition predicts, by conditioning Global Morans I on the binned degree distribution with the network held fixed. Significant signals are decomposed to modules, bottleneck genes, and statistical drivers through component jackknife, articulation-point, and gene-jackknife analysis.
Zhang, Y.; Han, M.; Ambalavanan, A.; Topouza, D.; Fang, Z. Y.; Stickley, S. A.; Anand, S.; Turvey, S.; Mandhane, P. J.; Simons, E.; Moraes, T. J.; Subbarao, P.; Choi, J.; Duan, Q.
Show abstract
Although genome-wide association studies (GWASs) have been widely applied to investigate the genetic basis of common traits and diseases in human populations, the associated loci do not fully account for the estimated heritability. The missing heritability may be explained, in part, by epistasis or gene-gene interactions. Existing methods for detecting epistasis, however, are limited to pair-wise interactions and/or targeted genomic regions. Here, we present a novel model, termed the Epistatic SNP Network Analysis (ESNA), which detects higher-order epistatic interactions using genome-wide SNP data. ESNA employs a scale-free network algorithm within a parallel computing framework that identifies modules of correlated SNPs, potentially interacting variants that converge on common biological pathways, while enhancing computational efficiency. We applied ESNA to investigate epistatic interactions contributing to respiratory outcomes such as recurrent wheeze and asthma among preschool-aged children in the CHILD Cohort Study. Using genome-wide data comprising 775,569 SNPs from 1,899 children, ESNA identified 914 SNP network modules, 9 of which were significantly associated with recurrent wheeze between ages 2 and 5 years (P<5.47x10-5). Furthermore, 7 of these wheeze-associated modules were also associated with asthma by age 5 years (P<5.47x10-5). Pathway enrichment analysis revealed that the associated modules consist of SNPs located in genes previously implicated in asthma and related biological processes, such as cellular response to stimuli and nervous system development. Compared to existing network-based methods for epistasis, ESNA demonstrated substantial improvements in computational efficiency, reducing memory usage by 50% and processing genome-wide SNP data 48 times faster. The code implementation and documentation are available at https://github.com/ComputationalGenomicsLaboratory/ESNA.
Long, W.; Hou, Y.; Zhang, Y.
Show abstract
MotivationReliable topologically associating domain (TAD) calling from Hi-C contact maps remains difficult at high resolution and realistic sequencing depth. A central reason is that many callers learn boundary evidence largely from local signals, while domain compatibility is handled mainly during downstream decoding, so the learned boundary scores are not explicitly optimized for the TAD assembly step that ultimately determines the final calls. ResultsWe present ContextTAD, a deep-learning TAD caller that learns boundary evidence from broader local Hi-C windows that capture TAD-scale structural context. Instead of treating boundary prediction as an isolated per-bin classification problem, ContextTAD uses a context-aware representation to produce left- and right-boundary tracks that are explicitly optimized for downstream TAD assembly. Concretely, the model combines multiscale feature extraction from 2D Hi-C windows with a pair objective that rewards compatible boundary combinations and a count objective that regularizes window-level boundary evidence. Due to the limited availability of high-quality TAD annotations, supervised deep-learning methods for TAD calling remain rare. To address this bottleneck, we construct improved training annotations by integrating high-coverage Hi-C structure with complementary boundary-associated genomic signals, thereby providing more reliable supervision for model training. We benchmarked ContextTAD against a broad panel of alternative TAD callers across standard comparative evaluation, sequencing-depth robustness analysis, and cross-cell-type transfer settings, and found that it performed strongly against alternative tools across this wide range of settings, with the best overall recovery of biologically supported TADs. Availabilityhttps://github.com/ai4nucleome/ContextTAD Contactyanlinzhang@hkust-gz.edu.cn
Thiel, M.; Barnes, C. P.
Show abstract
Generative DNA models are typically next-token completers: they extend a sequence but offer no native interface for telling the model what to make. PlasmidLM is a promptable DNA language model for plasmids. A designer supplies a human-readable component specification, for example a high-copy E. coli vector with kanamycin resistance and an EGFP reporter, and the model generates the corresponding multi-kilobase construct in a single autoregressive pass. Prompts are unordered sets of named-part tokens at the granularity of biological shorthand, not learned latent codes or rigid grammars. We evaluate outputs along two axes: a sequence is viable if structurally plausible as a plasmid, and faithful if its detected components match the prompt. Their conjunction is the useful-plasmid rate, the primary metric we report. On a held-out 1,000-prompt benchmark, the post-trained model achieves a useful-plasmid rate of 48.5% at single-shot decoding and 89.7% under best-of-4 sampling. Verifiable-reward post-training with GRPO against a 660-entry sequence motif registry improves the useful-plasmid rate across all sampling budgets. We release the 19.3M-parameter model, evaluation suite, and a paired benchmark of prompt-sequence pairs.
Wang, D.; Qin, F.; Bao, W.; Bacher, R.; Chung, D.; Lu, Q.; Efron, P. A.; Cai, G.; Xiao, F.
Show abstract
Copy number variations (CNVs) are major structural genomic variants that contribute to a wide range of human diseases. Accurate detection of CNVs from whole-exome sequencing (WES) data has been a long-sought goal for clinical and population genetic studies. Despite recent progress, existing WES-based CNV callers still suffer from high false-positive rates and reduced recall for short-length variants, and current deep learning methods have not fully used complementary information in region-level genomic features. Here we present CN-RNN, a deep learning-based CNV caller for WES data. The model combines a bidirectional long short-term memory (BiLSTM) branch that captures local depth changes and contextual dependencies across neighboring exons with a parallel multi-layer perceptron (MLP) branch that encodes region-level metadata such as GC content, mappability, and exon length. CN-RNN was trained on the Autism Sequencing Consortium (ASC) parent-child trio cohort using the Mendelian rule of inheritance to ensure high-quality training sets. It was evaluated across three independent datasets, in which we showed that CN-RNN outperformed existing WES-based CNV callers and deep learning methods. CN-RNN offers a scalable, accurate tool for CNV profiling in WES-based studies and supports broader application of CNV analysis in population and clinical research. CN-RNN is available at https://github.com/FeifeiXiao-lab/CN-RNN.